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Microprocessors load/store functional units and data caches. 



57) A load/store functional unit and a conre- 
sponding data cache of a superscalar microp- 
rocessor is disclosed. The load/store functional 
unit includes a plurality of reservation station 
entries which are accessed in parallel and 
which are coupled to the data cache in parallel. 
The load/store functional unit also includes a 
store buffer circuit having a plurality of store 
buffer entries. The store buffer entries are 
organized to provide a first in first out buffer 
where the outputs from less significant entries 
of the buffer are provided as inputs to more 
significant entries of the buffer. 
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The present invention relates to microproces- 
sors, and. more particularly, to providing micropro- 
cessors with high performance data caches and 
load/store functional units. 

Microprocessors are processors which are imple- 
mented on one or a very small number of semicon- 
ductor chips. Semiconductor chip technology is ever 
increasing the circuit densities and speeds within mi- 
croprocessors; however, the interconnection be- 
tween the microprocessor and external memory is 
constrained by packaging technology. Though on- 
chip interconnections are extremely cheap, off-chip 
connections are very expensive. Any technique in- 
tended to improve microprocessor performance must 
take advantage of increasing circuit densities and 
speeds while remaining within the constraints of 
packaging technology and the physical separation 
between the processor and its external memory. 
While increasing circuit densities provide a path to ev- 
ermore complex designs, the operation of the micro- 
processor must remain simple and clear for users to 
understand how to use the microprocessor. 

While the majority of existing microprocessors 
are targeted toward scalar computation, superscalar 
microprocessors are the next logical step in the evo- 
lution of microprocessors. The term superscalar de- 
scribes a computer implementation that improves 
performance by a concurrent execution of scalar In- 
structions. Scalar Instructions are the type of Instruc- 
tions typically found in general purpose microproces- 
sors. Using today's semiconductor processing tech- 
nology, a single processor chip can Incorporate high 
performance techniques that were once applicable 
only to large-scale scientific processors. However, 
many of the techniques applied to large scale proces- 
sors are either inappropriate for scalar computation or 
too expensive to be applied to microprocessors. 

A microprocessor runs application programs. An 
application program comprises a group of Instruc- 
tions. In running the application program, the proces- 
sor fetches and executes the instructions In some se- 
quence. There are several steps involved in the exe- 
cuting even a single instruction, Including fetching the 
Instruction, decoding It. assembling its operands, 
performing the operations specified by the Instruc- 
tion, and writing the results of the instruction to stor- 
age. The execution of instructions is controlled by a 
periodic dock signal. The period of the clock signal is 
the processor cycle time. 

The time taken by a processor to complete a pro- 
gram is determined by three factors: the number of in- 
structions required to execute the program; the aver- 
age number of processor cycles required to execute 
an Instruction; and, the processor cycle tinrie. Proces- 
sor performance Is Improved by reducing the time 
taken, which dictates reducing one or more of these 
factors. 

One way to Improve performance of the micro- 



processor is by overlapping the steps of different in- 
structions, using a technique called pipelining. To pi- 
peline instructions, the various steps of instruction 
execution are performed by independent units called 

5 pipeline stages. Pipeline stages are separated by 
clocked registers. The steps of different instructions 
are executed independently in different pipeline sta- 
ges. Pipelining reduces the average number of cycles 
required to execute an instruction, though not the to- 

10 tal amount of time required to execute an instruction, 
by permitting the processor to handle more than one 
instruction at a time. This is done without increasing 
the processor cyde time appreciably. Pipelining typ- 
ically reduces the average number of cydes per in- 

15 struction by as much as a factor of three. However, 
when executing a branch instruction, the pipeline 
may sometimes stall until the result of the branch op- 
eration Is known and the correct Instruction is fetched 
for execution. This delay Is known as the branch- 

20 delay penalty. Increasing the number of pipeline sta- 
ges also typically increases the branch-delay penalty 
relative to the average number of cycles per instruc- 
tion. 

Another way to Improve processor performance 
25 is to Increase the speed with which the microproces- 
sor assembles the operands of an instruction and 
writes the results of the instruction; these functions 
are referred to as a load and a store, respectively. 
Both of these functions depend upon the micropro- 
30 cesser's use of its data cache. 

During the development of early microproces- 
sors, instructions took a long time to fetch compared 
to the execution time. This motivated the develop- 
ment of complex instruction set computer (CISC) 
35 processors. CISC processors were based on the ob- 
servation that given the available technology, the 
number of cydes per instruction was determined 
mostly by the number of cycles taken to fetch the in- 
struction. To improve performance, the two principal 
40 goals of the CISC architecture were to reduce the 
number on Instructions needed for a given task and 
to encode these Instructions densely. It was accept- 
able to accomplish these goals by Increasing the 
average number of cydes taken to decode and exe- 
45 cute an Instruction because using pipelining, the de- 
code and execution cydes could be mostly overlap- 
ped with a relatively lengthy instruction fetch. With 
this set of assumptions, CISC processors evolved 
densely encoded instructions at the expense of de- 
50 code and execution time inside the processor. Multl- 
ple-cyde Instructions reduced the overall number of 
Instructions and thus reduced the overall execution 
time because they reduced the instruction fetch time. 
In the late 1970's and early 1980's, memory and 
55 packaging technology changed rapidly. Memory den- 
sities and speed Increased to the point where high 
speed local memories called caches could be imple- 
mented near the processor. Caches are used by the 
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processor to temporarily store instructions and data. 
When Instructions are fetched more quickly using ca- 
ches, the performance is limited by the decode and 
execution time that was previously hidden within the 
instruction fetch time. The number of instructions 
does not affect performance as much as the average 
number of cycles taken to execute an instruction. 

The improvement in memory and packaging 
technology, to the point where instruction fetching did 
not take much longer than instruction execution, mo- 
tivated the development of reduced instruction set 
computer (RISC) processors. To improve perfor- 
mance, the principal goal of a RISC architecture is to 
reduce the number of cycles taken to execute and in- 
struction, allowing some increase in the total number 
of Instructions. The trade-off between cycles per in- 
struction and the number of Instructions Is not one to 
one. Compared to CISC processors, RISC proces- 
sors typically reduce the number of cycles per in- 
struction by factors of three to five, while they typical- 
ly increase the number of instructions by thirty to fifty 
percent. RISC processors rely on auxiliary features 
such as a large number of general purpose registers, 
and instruction and data caches to help the compiler 
reduce the overall Instruction count or to help reduce 
the number of cycles per instruction. 

A typical RISC processor executes one instruc- 
tion on every processor cycle. A superscalar proces- 
sor reduces the average number of cycles per instruc- 
tion beyond what Is possible in a pipelined scalar 
RISC processor by allowing concurrent execution of 
instructions in the same pipeline stage as well as con- 
current execution of Instructions in different pipeline 
stages. The term superscalar emphasizes multiple 
concurrent operations on scalar quantities as distin- 
guished from multiple concurrent operations on vec- 
tors or arrays as is common in scientific computing. 

While superscalar processors are conceptually 
simple, there Is more to achieving increased perfor- 
mance than widening a processor's pipeline. WkJen- 
tng the pipeline makes It possible to execute more 
than one Instruction per cycle but there is no guaran- 
tee that any given sequence of instructions can take 
advantage of this capability. Instructions are not Inde- 
pendent of one another but are Interrelated; these in- 
terrelationships prevent some instructions from occu- 
pying the same pipeline stage. Furthermore, the proc- 
essor's mechanisms for decoding and executing In- 
structions can make a big difference in Its ability to 
discover instructions that can be executed at simul- 
taneously. 

Superscalar techniques largely concern the proc- 
essor organization independent of the instruction set 
and other architectural features. Thus, one of the at- 
tractions of superscalar techniques is the possibility 
of developing a processor that Is code compatible 
with an existing architecture. Many superscalar tech- 
niques apply equally well to either RISC or CISC ar- 
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chltectures. However, because of the regularity of 
many of the RISC architectures, superscalar techni- 
ques have initially been applied to RISC processor 
designs. 

5 The attributes of the instruction set of a RISC 

processor that lend themselves to single cycle decod- 
ing also lend themselves will to decoding multiple 
RISC instructions in the same clock cycle. These in- 
clude a general three operand load/store architec- 

10 ture, Instructions having only a few instruction 
lengths, instructions utilizing only a few addressing 
modes, Instructions which operate on fixed-width 
registers and register identifiers in only a few places 
within the instruction format. Techniques fordesign- 

15 ing a superscalar RISC processor are described in 
Superscalar Microprocessor Design , by William Mi- 
chael Johnson, 1991. Prentice-Hall. Inc. (a division of 
Simon & Schuster). Englewood Cliffs. New Jersey. 
In contrast to RISC Architectures, CISC architec- 

20 tures use a large number of different instruction for- 
mats. One CISC microprocessor architecture which 
has gained wide-spread acceptance is the x86 archi- 
tecture. This architecture, first introduced in the 
(386™ microprocessor, is also the basic architecture 

25 of both the i486™ microprocessor and the Pentium™ 
microprocessor, all available from the Intel corpora- 
lion of Santa Clara, California. The x86 architecture 
provides for three distinct types of addresses, a logi- 
cal address, a linear address and a physical address. 

30 The logical address represents an offset from a 

segment base address. The offset, referred to as the 
effective address. Is based upon the type of address- 
ing mode that the microprocessor is using. These ad- 
dressing modes provide different combinations of 

35 four address elements, a displacement, a base, an In- 
dex and a scale. The segment base address is ac- 
cessed via a selector. More specifically, the selector, 
which is stored in a segment register, is an index 
which points to a location in a global descriptor table 

40 (GOT). The GDT location stores the linear address 
corresponding to the segment base address. 

The translation between logical and linear ad- 
dresses depends on vyhether the microprocessor is in 
Real Mode or Protected Mode. When the micropro- 

45 cesser is in Real Mode, then a segmentation unit 
shifts the selector left four bits and adds the result to 
the offset to form the linear address. When the micro- 
processor is In Protected Mode, then the segmenta- 
tion unit adds the linear base address pointed to by 

50 the selector to the offset to provide the linear address. 

The physical address is the address which ap- 
pears on the address pins of the microprocessor and 
is used to physically address external memory. The 
physical address does not necessarily correspond to 

55 the linear address. If paging is not enabled then the 
32-bit linear address corresponds to the physical ad- 
dress. If paging is enabled, then the linear address 
must be translated into the physical address. A paging 
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unit performs this translation. 

The paging unit uses two levels of tables to trans- 
late the linear address into a physical address.- The 
first level table is a Page Directory and the second 
level table is a Page Table. The Page Directory in- 5 
eludes a plurality of page directory entries; each en- 
try includes the address of a Page Table and Informa- 
tion about the Page Table. The upper 1 0 bits of the lin- 
ear address (A22 - A31 ) are used as an index to select 
a Page Directory Entry. The Page Table includes a io 
plurality of Page Table entries; each Page Table entry 
includes a starting address of a page frame, referred 
to as the real page numtier of the page frame, and 
statistical information about the page. Address bits 
A12 - A21 of the linear address are used as an index is 
to select one of the Page Table entries. The starting 
address of the page frame is concatenated with the 
lower 12 bits of the linear address to form the physical 
address. 

Because accessing two levels of table for every 20 
memory operation substantially affects performance 
of the microprocessor, the x86 architecture provides 
a cache of the most recently accessed page table en- 
tries, this cache is called a translation lookaside buf- 
fer (TLB). The microprocessor only uses the paging 25 
unit when an entry is not In the TLB. 

The first processor conforming to the x86 archi- 
tecture which included a cache was the 486 proces- 
sor, which included an 8 Kbyte unified cache. The 
Pentium™ processor includes separate 8 Kbyte in- 30 
struction and data caches. The 486 processor cache 
and the Pentium™ processor caches are accessed 
via physical addresses; however, the functional units 
of these processors operate with logical addresses. 
Accordingly, when the functional units require access 35 
to the cache, the logical address must be converted 
to a linear address and then to a physical address. 

It has been discovered that by providing a micro- 
processor with a load section which Includes a plur- 
ality of reservation station entries which are ac- 40 
cessed in parallel, it Is possible to perform a plurality 
of load operations in parallel. 

It has also been discovered that by providing a mi- 
croprocessor with a store section which includes a 
plurality of store buffer entries which are organized as 45 
a first in first out buffer where the outputs from less 
significant entries of the buffer are provided as Inputs 
to more significant entries of the buffer, it is possible 
to perform store forwarding operations. 

In the accompanying drawings, by way of exann- so 
pie only: 

Fig. 1 is a block diagram of a superscalar micro- 
processor in accordance with the present invention. 

Fig, 2 Is a block diagram of a load/store functional 
unit and data cache in accordance with the present in- 55 
vention. 

Fig. 3 is a block diagram of a reservation station 
circuit of the Fig. 2 load/store functional unit 



Fig. 4 is a block diagram of the contents of an en- 
try in the Fig. 3 reservation station circuit. 

Fig. 5 is a block diagram of an adder circuit of the 
Fig. 3 reservation station circuit. 

Fig. 6 is a block diagram of a store buffer circuit 
of the Fig. 2 load/store functional unit 

Fig. 7 is a block diagram of the contents of an en- 
try of the Fig. 6 store buffer circuit. 

Fig, 8 is a block diagram of a store buffer entry 
of the Fig. 6 store buffer circuit 

Fig. 9 is a block diagram of an entry of the Fig. 2 
data cache. 

Fig. 10 is a block diagram of a store array and a 
linear tag array of the Fig. 2 data cache. 

Figs. 11 and 12 are block diagrams of the banking 
structure of the Fig. 10 store array. 

Fig. 13 is a timing diagram of a load operation in 
accordance with the present invention. 

Fig. 14 is a timing diagram of a store operation in 
accordance with the present invention. 

Fig. 15 is a timing diagram of a data cache miss 
during a speculative access operation in accordance 
with the present invention. 

Fig. 1 6 is a timing diagram of a data cache reload 
operation in accordance with the present invention. 

Fig. 1 7 is a timing diagram of a misaligned access 
operation in accordance with the present inventionj 

Referring to Fig, 1, the present embodiment can 
be best understood in the context of superscalar X86 
microprocessor 100 which executes the X86 instruc- 
tion set Microprocessor 100 is coupled to physically 
addressed external memory 101 via a 486 XL bus or 
other conventional micoprocessor bus. Microproces- 
sor 100 includes instruction cache 104 which is cou- 
pled to byte queue 1 06 which is coupled to instruction 
decoder 108. Instruction decoder 108 is coupled to 
RISC core 100. RISC core 100 includes register file 
1 1 2 and reorder buffer 1 1 4 as well as a variety of func- 
tional units such as shift unit 130 (SHF). arithmetic 
logic units 131, 132 (ALUO and ALU1), special regis- 
ter block 133 (SRB), load/store unit 134 (LSSEC). 
branch section 135 (BRNSEC). and floating point unit 
136 (FPU). 

Rise core 110 includes A and B operand buses 
116, type and dispatch (TAD) bus 118 and result bus 
140 which are coupled to the functional units as well 
as displacement and instruction, load store (INLS) 
bus 119 which is coupled between instruction decod- 
er 1 08 and load/store unit 1 34. Aand B operand buses 
116 are also coupled to register file 112 and reorder 
buffer 114. TAD bus 118 is also coupled to instruction 
decoder 108. Result bus 140 is also coupled to reor- 
der buffer 114. Additionally, branch section 135 is 
coupled to reorder buffer 114, instruction decoder 
108 and instruction cache 104 via Xtarget bus 103. A 
and B operand buses 116 includes four parallel 41 -bit 
wide A operand buses and four parallel 41-bit wide B 
operand buses as well as four parallel 12-bit wide A 
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tag buses, four parallel 12-bit wide B tag buses, a 12- 
bit wide A tag valid bus a 1 2-bit wide B tag valid bus, 
four 4-bit wide destination tag buses and four 8-bit 
wide opcode buses. Type and dispatch bus 118 in- 
cludes four 3-bit wide type code buses and one 4-bit 5 
wide dispatch buses. Displacement and INLS bus 119 
includes two 32-bit wide displacement buses and two 
8-bit wide INLS buses. 

In addition to instruction cache 104, microproces- 
sor 100 also includes data cache 150 (DCACHE) and io 
physical tag circuit 162. Data cache 150 is coupled to 
Load/store functional unit 134 of the RISC core and 
with intraprocessor address and data (IAD) bus 102. 
Instruction cache 104 is also coupled with IAD bus 
102. Physical tag circuit 162 interacts with both in- is 
struction cache 104 and data cache 150 via the IAD 
bus. Instruction cache 104 and data cache 150 are 
both linearly addressable caches. Instruction cache 
1 04 and data cache 1 50 are physically separate, how- 
ever, both caches are organized using the same ar- 20 
chitecture. 

Microprocessor 100 also includes memory man- 
agement unit (MMU) 164 and bus Interface unit 160 
(BlU). TLB 164 Is coupled with the IAD bus and phys- 
ical translation circuit 162. Bus interface unit 160 is 25 
coupled to physical translation circuit 162, data cache 
150 and IAD bus 102 as well as an external micropro- 
cessor bus such as the 486 XL bus. 

Microprocessor 100 executes computer pro- 
grams which include sequences of instructions. Com- 30 
puter programs are typically stored on a hard disk, 
floppy disk or other non-volatile storage media which 
are located in the computer system. When the pro- 
gram is run, the program is loaded from the storage 
media into main memory 101. Once the instructions 35 
of the program and associated data are in main mem- 
ory 101, individual instructions are prepared for exe- 
cution and ultimately executed by microprocessor 
1 00. 

After being stored in main memory 101. the in- 40 
structions are passed via bus interface unit 160 to in- 
struction cache 104, where the instructions are tenrv 
porarily held. Instruction decoder 1 08 retrieves the in- 
structions from instruction cache 104, examines the 
instructions and determines the appropriate action to * 45 
take. For example, decoder 108 may determine 
whether a particular instruction is a PUSH, POP, 
LOAD, STORE, AND, OR, EX OR, ADD, SUB, NOP, 
JUMP, JUMP on condition (BRANCH) or other in- 
struction. Depending on which particular instruction so 
that decoder 108 determines is present, the instruc- 
tion is despatched to the appropriate functional unit 
of RISC core 110. LOADs and STORES are the pri- 
mary two instructions which are dispatched to load 
store sectton 134. Other instructions which are exe- ss 
cuted by load/store functional unit 134 include PUSH 
and POP. 

The instructions typically include multiple fields 



in the following format: OP CODE, OPERAND A, OP- 
ERAND B and DESTINATION. For example, the in- 
struction ADD A, B, C means add the contents of reg- 
ister A to the contents of register B and place the re- 
sult in register C. LOAD and STORE operations use 
a slightly different format. For example, the instruc- 
tion LOAD A, B, C means place data retrieved from 
an address on the result bus, where A, B and C rep- 
resent address components which are located on the 
A operand bus, the B operand bus and the displace- 
ment bus, these address components are combined 
to provide a logical address which is combined with a 
segment base to provide the linear address from 
which the data is retrieved. Also for example, the in- 
struction STORE A, B, C means store data in a loca- 
tion pointed to by an address, where A is the store 
data located on the A operand bus and B and C rep- 
resent address components which are located on the 
B operand bus and the displacement bus, these ad- 
dress components are combined to form a logical ad- 
dress which is combined with a segment base to pro- 
vide the linear address to which the data is stored. 

The OP CODES are provided from instruction de- 
coder 108 to the functional units of RISC core 110 via 
opcode bus. Not only must the OP CODE for a par- 
ticular instruction be provided to the appropriate func- 
tional unit, but also the designated OPERANDS for 
the instruction must be retrieved and sent to the func- 
tional unit If the value of a particular operand has not 
yet been calculated, then that value must be first cal- 
culated and provided to the functional unit before the 
functional unit can execute the instruction. Forexanri- 
ple, if a current instruction is dependent on a prior in- 
struction, the result of the prior instruction must be 
determined before the current instruction can be exe- 
cuted. This situation is referred to as a dependency. 

The operands which are needed for a particular 
instruction to be executed by a functional unit are pro- 
vided by either register file 112 or reorder buffer 114 
to the operand bus. The operand bus conveys the op- 
erands to the appropriate functional units. Once a 
functional unit receives the OP CODE, OPERAND A, 
and OPERAND B, the functional unit executes the in- 
struction and places the result on a result bus 140, 
which is coupled to the outputs of all of the functional 
units and to reorder buffer 114. 

Reorder buffer 114 is managed as a first in first 
out (FIFO) device. When an instruction is decoded by 
instruction decoder t08, a corresponding entry Is al- 
located in reorder buffer 114. The result value com- 
puted by the instruction is then written into the allo- 
cated entry when the execution of the instruction is 
completed. The result value is subsequently written 
into register file 112 and the instruction retired if there 
are no exceptions associated with the instruction and 
if no speculative branch is pending which affects the 
instruction. If the instruction is not complete when its 
associated entry reaches the head of the reorder buf- 
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fer 114, the advancement of reorder buffer 114 is halt- 
ed until the instruction is completed. Additional en- 
tries, however, can continue to be allocated. 

Each functional unit includes respective reserva- 
tion station circuits (RS) 120 - 126 for storing OP 5 
CODES from instructions which are not yet complete 
because operands for that instruction are not yet 
available to the functional unit. Each reservation sta- 
tion circuit stores the instruction's OP CODE together 
with tags which reserve places for the missing oper- 10 
ands that will arrive at the reservation station circuit 
later. This technique enhances performance by per- 
mitting microprocessor 100 to continue executing 
other instructions while the pending instruction is be- 
ing assembled with its operands at the reservation 15 
station. 

Microprocessor 100 affords out of order issue by 
isolating decoder 108 from the functional units of 
RISC core 110. More specifically, reorder buffer 114 
and the reservation stations of the functional units ef- 20 
fectively establish a distributed instruction window. 
Accordingly, decoder 108 can continue to decode in- 
structions even if the instructions can not be imme- 
diately executed. The instruction window acts as a 
pool of instructions from which the functional units 25 
draw as they continue to go forward and execute in- 
structions. The instruction window thus provides mi- 
croprocessor 100 with a look ahead capability. When 
dependencies are cleared and as operands become 
available, more instructions in the window are execut- 30 
ed by the functional units and the decoder continues 
to fill the window with yet more decoded instructions. 

Microprocessor 100 uses branch section 135 of 
the RISC core to enhance its performance. Because 
when a branch occurs, the next instruction depends 35 
upon the result of the branch, branches in the instruc- 
tion stream of a program hinder the capability of the 
microprocessor to fetch instructions. Branch section 
135 predicts the outcomes of branches which occur 
during instruction fetching. I.e., branch section 135 4o 
predicts whether branches should be taken. For ex- 
ample, a branch target buffer is employed to keep a 
running history of the outcomes of prior branches. 
Based on this history, a decision is made during a par- 
ticular fetched branch to determine which branch the 45 
fetched branch instruction will take. If there is an ex- 
ception or branch misprediction, then the contents of 
reorder buffer 114 allocated subsequent to the mis- 
predicted branch instruction are discarded. 

Referring to Fig, 2, load/store functional unit 134 50 
is the functional unit which interacts with data cache 
150 and executes all LOAD instructions and all 
STORE instructions. Load/store functional unit 134 
includes reservation station circuit 124, store buffer 
circuit 1 80 and load store controller 1 82. Reservation 55 
station circuit 124 includes four reservation station 
entries (RSO - RS3) and store buffer circuit 180 in- 
cludes four, store buffer entries (SBO - SB3). 



Reservation station circuit 124 holds all of the 
fields which are required to perform a load operation 
or a store operation. Data elements can be issued to 
two reservation station entries per clock cycle and 
can be retired from two reservation station entries 
per clock cycle. Reservation station circuit 124 is cou- 
pled to the four result buses, 40 bits of the four 41 -bit 
A operand buses, 32 bits of the four B 41 -bit operand 
buses, the A and B tag valid buses, the four A tag bus- 
es, the four B tag buses, the four destination tag bus- 
es, the four type code buses, the two displacement 
buses and the two INLS buses as well as to the 32- 
bit data portions of ports A and B of data cache 150. 
Reservation station circuit 124 is coupled to store 
buffer circuit 180 via a 40-bit A operand bus, a 32-bit 
reservation station data bus (RDATAA, RDATAB, re- 
spectively), a 12-bit Atag bus (TAG A) and a 12-bit B 
tag bus (TAG B) as well as two 32-bit address buses 
(ADDR A, ADDR 8); the two address buses are also 
coupled to the address portions of ports A and B of 
data cache 150. Reservation station 124 is coupled 
to controller 182 via a reservation station load bus 
and a reservation station shift bus. 

In addition to being coupled to reservation station 
circuit 124, store buffer circuit 180 is coupled with the 
four result buses and is also coupled with load store 
controller 1 82 via store buffer load bus and store buf- 
fer shift bus. Store buffer circuit 180 is also coupled 
to IAD bus 102. 

In addition to being coupled to reservation station 
circuit 1 24 and store buffer circuit 1 80, load store con- 
troller 182 is coupled to physical tag circuit 162 and 
reorder buffer 114. Controller 182 is also coupled to 
cache controller 1 90 of data cache 1 50. 

Data cache 150 is a linearly addressed 4-way in- 
terleaved, 8 Kbyte 4-way set associative cache which 
supports two accesses per clock cycle, i.e., data 
cache 150 supports dual execution. Each set of data 
cache 150 includes 128 entries; each entry includes 
a sixteen byte block of information. Each sixteen byte 
block of infonnation is stored in a line of four individ- 
ually addressable 32-bit banks. By providing data 
cache 150 with individually addressable banks, data 
cache 150 functions as a dually accessible data 
cache without requiring the overhead associated with 
providing dual porting. Data cache 1 50 is dually ac- 
cessible via data cache port A and data cache port B 
thus allowing data cache 150 to perform two load op- 
erations simultaneously. Data cache port A includes 
a data portion, DATA A, and an address portion. 
ADDR A; data cache port B includes a data portion, 
DATA B, and an address portion, ADDR B. 

Data cache 150 includes data cache controller 
1 90 and data cache array 192, Data cache controller ' 
190 provides control signals to orchestrate the vari- 
ous operations of data cache 150. Data cache array 
192 stores data under the control of data cache con- 
troller 1 90, Data cache anray 1 92 is organized into two 
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arrays, data store array 200 and linear tag and status 
array 202. Data cache array 200 provides two data 
signals, DATA A. DATA to load/store functional unit 
1 34. Linear tag array 202 receives two linear address- 
es, ADDR A, ADDR B, which are provided by 
load/store functional unit 134 and provides two 4-bit 
tag hit signals, COL HITA 0 - 3. COL HIT B 0 - 3, to 
cache array 200. Linear addresses ADDR A and 
ADDR B are also provided to data store array 200. 

During a load operation, reservation station cir- 
cuit 124 of load store functional unit 134 provides an 
address to data cache 150; if this address generates 
a cache hit, then data cache 150 provides the data 
which is stored in the corresponding bank and block 
of store array 200 to reservation station circuit 124. If 
the address was provided to data cache 150 via port 
A, then the data is provided to reservation station cir- 
cuit 124 via port A; alternately, if the address was pro- 
vided to data cache 150 via port B, then the data is 
provided to reservation station circuit 124 via port B. 
Addresses are provided to data cache 150 and data 
is received from data cache 1 50 via port A and port B 
simultaneously. 

During a store operation, the store data is provid- 
ed from reservation station circuit 124 to store buffer 
circuit 180. When the store operation is released, the 
data, and corresponding address, which is being stor- 
ed is provided via IAD bus 102 to data cache 150. 

Referring to Fig. 3, reservation station circuit 124 
is a dual access reservation station which functions 
as a first in first out (FIFO) buffer. Reservation station 
circuit 124 includes input 0 multiplexer circuit 206, in- 
put 1 multiplexer circuit 208 and four reservation sta- 
tion entries RS0 210, RS1 211, RS2 212 and RS3 213 
as well as reservation station 0 adder circuit 216, res- 
ervation station 1 adder circuit 218 and reservation 
station driver circuit 220. 

Multiplexer circuits 206, 208 receive as inputs the 
four A operand buses, the four B operand buses, the 
A and B tag valid buses, the four A tag buses, the four 
B tag buses, the four destination tag buses, the four 
opcode buses, the two INLS buses and the two dis- 
placement buses. Multiplexer circuits 206, 208 also 
receive bus select signals from load store controller 
182. The bus select signals are generated based 
upon type code matches. 

A type code match occurs when a type code on 
one of the four type code buses corresponds to the 
type code assigned to the load store functional unit. 
When a type code matches, load store controller 180 
generates a bus select signal indicating from which 
bus information should be retrieved. Reservation sta- 
tion circuit 1 24 can retrieve signals from two Injses si- 
multaneously. Accordingly, first and second sets of 
bus select signals are generated by load store con- 
troller 182 for input 0 multiplexer 206 and input 1 mul- 
tiplexer 208, respectively. 

Under control of the first set of bus select signals. 



- 1 



651 323 A1 




multiplexer circuit 206 provides a first multiplexed 
reservation station input signal (INPUT 0) which is 
provided as an input signal to the reservation sta- 
tions. The INPUT 0 signal includes a signal from one 
5 of the Aoperand buses, a signal from one of the B op- 
erand buses, a tag from one of the A tag buses, tag 
valid bits corresponding to the A tag from the corre- 
sponding tag valid bus, a tag from one of the B tag 
buses, tag valid bits corresponding to the B tag from 

10 the corresponding tag valid bus, a destination tag 
from one of the destination tag buses, a opcode from 
one of the opcode buses and a displacement from 
one of the displacement buses. Under control of the 
second set of bus select signals, multiplexer circuit 

15 208 provides a second multiplexed reservation sta- 
tion signal (INPUT 1) which is provided as a second 
input signal to the reservation stations. The INPUT 1 
signal includes a signal from one of the A operand 
buses, a signal from one of the B operand buses, a 

20 tag from one of the A tag buses, tag valid bits corre- 
sponding to the Atag from the corresponding tag valid 
bus, a tag from one of the B tag buses, tag valid bits 
corresponding to the B tag from the corresponding 
tag valid bus, a destination tag from one of the des- 

25 tination tag buses, a opcode from one of the opcode 
buses and a displacement from one of the displace- 
ment buses. 

Reservation station entries 210 - 213 each re- 
ceive the two input signals, INPUT 0, INPUT 1, in par- 

30 allel as well as respective load and shift bits. Reser- 
vation station entries 210-213 also receive inputs 
from each of the four result buses; these result bus in- 
puts are provided to the A and B operand portions of 
the entry only. Information is retrieved from these re- 

35 suit buses based upon the A operand tag and the B 
operand tag. For example, when an A operand tag 
provides a hit to information that is on one of the des- 
tination tag buses, then information from the corre- 
sponding result bus is retrieved and loaded into the A 

40 Operand field of the reservation station entry. 

Additionally, reservation station entry RSO re- 
ceives a reservation station entry from either reser- 
vation station RSI or RS2; reservation station entry 
RSO provides a portion of the RSO reservation station 

45 entry (the A operand portion) to store buffer circuit 
1 80 as the RDATA A signal and provides the entire 
RSO reservation station entry to RO adder 216. RO 
adder 216 uses this reservation station entry to gen- 
erate the ADDR A signal. Reservation station entry 

50 RSI receives a reservation station entry from reser- 
vation stations RS2 and RS3; reservation station en- 
try RSI provides a portion of the RSI reservation sta- 
tion entry (the Aoperand portion) to store buffer cir- 
cuit 180 as the RDATA B signal and provides the en- 

55 tire RSI reservation station entry to R1 adder 218. 
R1 adder 218 uses this reservation station entry to 
generate the ADDR B signal. Reservation station 
RS2 receives a reservation station entry from reser- 
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' vation station RS3; reservation station entry RS2 
provides the RS2 reservation station entry to reser- 
vation stations RSI and RSO. Reservation station 
RS3 provides the RS3 reservation station entry to 
reservation stations RS2 and RS1. 

By providing the parallel inputs and outputs from 
the reservation stations as well as the parallel for- 
warding structure, reservation station circuit 1 24 may 
execute either one or two load operations per cycle. 
More specifically, using the load and shift signals, 
controller 182 controls the loading and shifting of res- 
ervation station entries so that either one or two res- 
ervation station entries may be loaded or shifted in 
any given cycle. 

When one reservation station entry Is being exe- 
cuted per cycle, then reservation station RSO pro- 
vides a reservation station entry to RSO adder circuit 
216 for both load and store operations; additionally, 
for a store operation RSO provides the reservation 
station entry to store buffer 180. Reservation station 
RS1 provides a reservation station entry to reserva- 
tion station RSO. reservation station RS2 provides a 
reservation station entry to reservation station RSI 
and reservation station RS3 provides a reservation 
station entry to reservation station RS2. For a load 
operation, the data corresponding to the address gen- 
erated by RSO adder circuit 216 is provided to driver 
circuit 220. 

When two reservation station entries are being 
executed per cycle, then reservation stations RSO 
and RSI provide respective reservation station en- 
tries to adder circuits 216, 218 for both load and store 
operations. Reservation stations RS2 and RS3 pro- 
vide reservation station entries to reservation sta- 
tions RSO and RSI, respectively: For load operations, 
the data corresponding to the addresses generated 
by the RSO and RSI adder circuits are provided as 
DATA A and DATA B from data cache 150. When two 
reservation station entries are being executed per cy- 
cle, and one operation is a load and the other opera- 
tion is a store, then the reservation station entry from 
which the store operation is being performed is pro- 
vided to store buffer 180. 

If a load produces a cache miss, then a load miss 
algorithm must be executed. Since this load operation 
is speculative, the miss operation is not initiated until 
the load is the next ROP to retire. Because of this, the 
load holds in the RSO reservation station and vvaits 
for the release signal from the reorder buffer, Astatus 
indication along with the destination tag is driven back 
to the reorder buffer to indicate this condition. 

Referring to Fig. 4. each reservation station entry 
124 includes a reservation station entry valid bit (v), 
a 40-bit Aoperand field, a 32-bit B operand field, a 32- 
bit displacement field, a 4-bit destination tag (DTAG) 
field, an 8-bit opcode field and an 8-bit additional op- 
code information (INLS) field. Additionally, each res- 
ervation station entry also includes a 4-bit Aoperand 
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upper byte tag (ATAGU), an 4-bit A operand middle 
byte tag (ATAGM) and an 4-bit Aoperand lower byte 
tag (ATAGL), a 4-bit B operand upper byte tag (BTA- 
GU), an 4-bit 8 operand middle byte tag (BTAGM) and 
an 4-bit B operand lower byte tag (BTAGL) along with 
the corresponding A and B operand tag valid bits. 
Each reservation station entry also includes a corre- 
sponding cancel bit (C). 

The Aoperand upper, middle and lower byte tags 
are tags for upper, middle and lower portions of an in- 
teger operand. The integer operand is divided this 
way because, under the x86 architecture, it is possi- 
ble to reference either the upper or lower byte of the 
lower half-word, the lower half-word or the entire 32- 
bit double word of an x86 integer. Accordingly, the M 
and L refer to the upper and lower byte of the lower 
half word and the U refers to the upper half-word for 
a B operand and to the remaining upper bits for an A 
operand (because the remaining portion of the A op- 
erand may be either 16-bits or 24-bits). When refer- 
encing the lower half word, the L and M tags are set 
to the same value. All three tags are set to the same 
value when referencing a 32-bit value which is pend- 
ing in the reservation station entry. 

The cancel bit indicates that a particular opcode 
should be cancelled; this bit is set when any opcode 
is within a mispredicted branch. The opcode is cancel- 
led in order to prevent cancelled stores that hit in data 
cache 150 from entering store buffer circuit 180 as 
stores that are executed update the state of an entry 
which is stored in data cache 150. Loads that are can- 
celled just return the results when there is a hit in data 
cache 150 and thus are not problematic as a load 
does not update any state. 

The reservation station entry valid bit of the res- 
ervation station entry is coupled to the dispatch valid 
bit portion of the INPUT O and INPUT 1 input signals. 
Each input signal valid bit, which is coupled to the dis- 
patch bus. is set when the dispatch valid bit is set. The 
Aoperand field of the reservation station entry is cou- 
pled to the A operand portion of the INPUT 0 and IN- 
PUT 1 input signals. The B operand field of the res- 
ervation station entry is coupled to the B operand 
portion of the INPUT 0 and INPUT 1 input signals. The 
displacement field of the reservation station entry is 
coupled to the displacement portion of the INPUT 0 
and INPUT 1 input signals. The destination tag field 
of the reservation station entry is coupled to the des- 
tination tag portion of the INPUT 0 and INPUT 1 input 
signals. The opcode field of the reservation station 
entry is coupled to the opcode portion of the INPUT 
O and INPUT 1 input signals. The additional opcode 
information (INLS) field of the reservation station en- 
try is coupled to the INLS portion of the INPUT 0 and ' 
INPUT 1 input signals via the INLS buses. 

The A operand upper byte tag. middle byte tag. 
and lower byte tag of the reservation station entry are 
coupled to the A tag portion of INPUT 0 and INPUT 1 
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input signals. The B operand upper byte tag. middle 
byte tag and tower byte tag are coupled to the B tag 
portion of the INPUT 0 and INPUT 1 input signals. The 
A and B operand tag valid bits of the reservation sta- 
tion entry are coupled to the tag valid portion of the 
INPUT 0 and INPUT 1 input signals. The cancel bit of 
the reservation station entry is coupled to load store 
controller 182. and is set based upon control informa- 
tion which is received from reorder buffer 114 and 
branch section 135. 

The type match signals which are generated by 
load store controller 182 determine whether any in- 
structions have been dispatched to the load store 
functional unit. More specifically, when load store 
controller 1 82 determines that the load store function- 
al unit type code matches a type code that is provided 
on one of the four TAD buses, then load store control- 
ler 182 selects that particular dispatch position for the 
INPUT 0 signal. When load store controller 1 82 deter- 
mines that the load store functional unit type code 
matches a type code that is provided by another one 
of the four TAD buses, then load store controller 182 
selects that particular dispatch position for the INPUT 
1 signal. 

Referring to Fig. 5. RSO adder circuit 216 receives 
address components from reservation station 210 
and provides the linearaddress signal ADDR Aas well 
as a valid segment access signal. RSO adder circuit 
216 includes logical address adder 240 and linear ad- 
dress adder 242. Logical address adder 240 provides 
a logical address to linear address adder 242. Logical 
address adder 240 receives an A operand adder sig- 
nal from A operand multiplexer 244, a B operand ad- 
der signal from B operand multiplexer 246 and a dis- 
placement adder signal from displacement multiplex- 
er 248. 

A operand multiplexer circuit 244 receives the A 
operand from reservation station entry 210. as well 
as a quantity zero; the value which is multiplexed and 
provided as the A operand adder signal is determined 
by the address mode control information which is re- 
ceived from load store controller 182. B operand mul- 
tiplexer circuit 246 receives a scaled B operand from 
shift circuit 247. The B operand is scaled based upon 
the scale signal which is received from instruction de- 
coder 108 via the INLS buses. B operand multiplexer 
circuit 246 also receives a start address which is stor- 
ed in start address register 249 under control of load 
store controller 182 and the misaligned access 1 ad- 
dress which is stored in misaligned access register 
451 from a prior misaligned access; the value which 
is multiplexed and provided as the B operand adder 
signal is determined by the address mode control in- 
formation. Displacement multiplexer circuit 248 re- 
ceives the displacement address component from 
reservation station entry 210. Displacement multi- 
plexer circuit 248 also receives the quantities four, 
five, minus four and minus two; the value which is 
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multiplexed and provided as the displacement adder 
signal is determined by the address mode control in- 
formation. 

For an aligned access load operation, the A op- 
5 erand is selected by multiplexer 244, the B operand 
is selected by multiplexer 246 and the displacement 
is selected by 248. For a misaligned access load op- 
eration, i.e.. any access which crosses a doubleword 
boundary, the first misaligned access address is gen- 
re erated as a normal load operation and adder 240 gen- 
erates a misaligned access 1 address. Misaligned ac- 
cess 1 register 251 holds this misaligned access 1 ad- 
dress. In the next clock cycle, the value 0 is selected 
by A operand multiplexer 244, the value 4 is selected 
15 by B operand multiplexer 246, and the misaligned ac- 
cess 1 address is selected by multiplexer 248. thus 
causing adder 240 to adder the quantity 4 to the mis- 
aligned access 1 address. For a multiple ROP oper- 
ation, e.g. , a 64-bit load operation, the first access ad- 
20 dress is generated as a normal load operation and ad- 
der 240 generates a multiple ROP start address. Start 
address register 249 holds this start address. When 
the second ROP Is accessed, the second ROP ad- 
dress is formed by adding the start address from mul- 
25 tiplexer 248 and the value 4 from multiplexer 246. For 
an 80-bit multiple ROP operation, the value 5 is pro- 
vided by multiplexer 246. Each multiple ROP opera- 
tion can be misaligned; in this instance, the start ad- 
dress is functionally the same as the misaligned ac- 
30 cess address 1 . For a PUSH operation, a value is sub- 
tracted from the B operand address depending on the 
access size of the operation. If the access size is a 
doubleword, then the value 4 is subtracted; if the ac- 
cess size is a word then the value 2 is subtracted. ^The 
35 scaling factor, which controls shift circuit 247. is gen- 
erated by toad store controller 182 based upon the 
INLS information. 

Adder circuit 216 also includes segment descrip- 
tor array 250 and limit check circuit 252. Segment de- 
40 scriptor array 250 provides a segment limit signal to 
limit check circuit 252 and a segment base address 
signal to adder circuit 242. Limit check circuit 252 also 
receives the logical address from logical adder 240 
and provides a valid segment access signal indicating 
45 that the logical address is within the segment limits as 
set forth by the limit provided by segment descriptor 
array 250. 

Adder circuit 240 receives the A operand adder 
signal, the B operand adder signal and the disptace- 

50 ment adder signal and adds these signals to provide 
a logical address signal. Adder circuit 242 adds the 
segment base address, which is received from seg- 
ment descriptor array 250, to the logical address to 
provide the linear address. 

55 RS1 adder 218 is similar to RSO adder with the 

exception that RSI adder 218 does not include mul- 
tiplexer 248 because unaligned accesses are only 
performed using the RSO reservation station. In RSI 
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adder 218. the displacement is provided directly to 
adder 240 as the displacement adder signal. Addition- 
ally, because unaligned accesses are not performed, 
multiplexer 246 for RS1 adder 218 is not provided with 
the values 4 and 5. 

Referring to Fig. 6, store buffer includes four store 
buffer entries SBO 300. SB1 301. SB2 302, and SB3 
303 as well as A port merge circuit 306 and B port 
merge circuit 308. A port merge circuit 306 receives 
the A port data signal from data cache 150 and the A 
port data signal from reservation station entry RSO of 
reservation station circuit 1 24 and merges these sig- 
nals to provide a merged A data signal to store buffer 
entries SBO - SB3. B port merge circuit 308 receives 
the B port data signal from data cache 150 and the B 
port data signal from reservation station entry RSI of 
reservation station circuit 124 and merges these sig- 
nals to provide a merged B data signal to store buffer 
entries SBO - SB3. By providing merge circuits 306, 
308, a steering function is provided. 

For example, one byteof the four byte DATAAslg- 
nal may have been updated when provided by reser- 
vation station circuit 124. This updated byte is then 
merged with the three remaining bytes from the DATA 
A signal which is provided by data cache 150. Merge 
circuits 206, 308 are controlled by load store control- 
ler 182 based upon the access size, least significant 
two bits of the linear address and whether an access 
Is a misaligned access 1 or a misaligned access 2. 
The steering function which is provided by merge cir- 
cuits 306, 308 is possible because stores are done as 
read modify write operations. By providing this steer- 
ing function, data cache 150 does not require com- 
plex steering circuitry because all accesses to data 
cache 1 50 are 32-bit double word accesses. Addition- 
ally, whatever Information is in a store buffer entry Is 
a reflection of what will be stored in data cache 150, 
thus allowing load store functional unit 134 to provide 
a load forwarding operation. In a load forwarding op- 
eration, loads may be performed before stores are ac- 
tually stored In data cache 1 50 by accessing the store 
buffer entries; load forwarding removes store opera- 
tions from the microprocessor's critical timing path. 

Each store buffer entry also receives input sig- 
nals from the four result buses, the ADDR A and 
ADDR B address signals from reservation station 124 
and the TAG A and TAG B tag signals from reserva- 
tion station 124 as well as control signals from load 
store controller 182. These control signals Include the 
load signals and the shift signals. Additionally, store 
buffer entry SBO receives an output from store buffer 
entry SB1 and provides a store output to IAD bus 102. 
Store buffer entry SB1 receives a store buffer entry 
output from store buffer entry SB2 and also receives 
a store buffer entry output from store buffer entry 
SBO and provides a store buffer entry output to SBO. 
Store buffer entry SB2 receives a store buffer entry 
output store buffer entry SB3 and also receives an 



entry from store buffer entries SBO and SB1 , and pro- 
vides a store buffer entry output to SB1 . Store buffer 
SB3 receives a store buffer entry output from store 
buffer entries SBO, SB1 and SB2and provides a store 
5 buffer entry output to SB2. 

By providing store buffer entries SB1 - SB3 with 
the feed back from less significant store buffer en- 
tries, a store forwarding operation is possible. For ex- 
ample, the store buffer entry SBO is provided to more 
10 significant store buffer entries SB1 - SB3 to allow 
these store buffer entries to combine the SBO store 
buffer entry with the more significant entries when 
the entries have the same linear address. According- 
ly, when the store buffer entry is stored, is includes 
IS all modifications to the entry. The store forwarding 
function is discussed In more detail below. 

Store forwarding allows the system to operate 
without having to stall a reservation station until a pre- 
vious store Is stored in data cache 150. Because, in 
20 the x86 architecture, a significant number of consec- 
utive byte accesses occur, store forwarding signifi- 
cantly Increases the speed with which loads are per- 
formed by removing the dependency of a load oper- 
ation on a store operation. 
25 Referring to Fig. 7, each store buffer entry SBO - 

SB3 of store buffer circuit 180 includes the informa- 
tion set forth In store buffer entry 339, Store buffer 
entry 339 Includes 32-bit data double word 340, tag 
portion 341 , 32-blt linear address 342 and control in- 
30 formation portion 344. Data double word 340 includes 
four data bytes, data byte 0 - data byte 3. 

Tag portion 341 includes four byte tag portions 
which correspond to data bytes 0 - 3. Byte 0 tag por- 
tion Includes a byte 0 tag (TAG BYTE 0), a byte 0 con- 
35 trol bit (BO) and a byte 0 tag valid bit (TV). Byte 1 tag 
portion includes a byte 1 tag (TAG BYTE 1), a byte 1 
control bit (B1) and a byte 1 tag valid bit (TV). Byte 2 
tag portion includes a byte 2 tag (TAG BYTE 2), byte 
2 control bits (BO, B1) and a byte 2 tag valid bit (TV). 
40 Byte 3 tag portion includes a byte 3 tag (TAG BYTE 
3), byte 3 control bits (BO, B1) and a byte 3 tag valid 
bit (TV). 

The byte tags TAG BYTE 0 - 3 provide tags to re- 
trieve data bytes 0-3 from result buses. The byte 

45 control bits indicate from which result bus byte a data 
byte should be retrieved. More specifically, when 
byte 0 control bit BO is set. it indicates that data should 
be forwarded from a result bus byte 1 ; If byte O con- 
trol bit BO is cleared then data should be forwarded 

50 from a result bus byte 0. When byte 1 control bit B1 
is set. it indicates that data should be forwarded from 
a result bus byte 0; if byte 1 control bit B1 is cleared, 
then data should be forwarded from a result bus byte 
1 . When byte 2 control bit B1 is set. It indicates that 

55 data should be forwarded from a result bus byte 1 and 
when byte 2 control bit BO is set. it indicates that data 
should be forwarded from a result bus byte 0; if byte 
2 control bits BO and B1 are cleared then data should 
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be forwarded from a result bus byte 2. When byte 3 
control bit B1 is set« it indicates that data should be 
forwarded from a result bus byte 1 and when byte 3 
control bit BO is set, it indicates that data should be 
forwarded from a result bus byte 0; if byte 3 control 
bits 80 and 81 are cleared then data should be for- 
warded from a result bus byte 3. The byte tag valid 
bits TV indicate that the corresponding tag field con- 
tains a valid byte tag. 

The store buffer tags indicate that actual byte lo- 
cations in memory whereas with the reservation sta- 
tion tags, there is not a one to one correspondence 
between tag and location. With the reservation sta- 
tion tags, the L and M tags can map to any locations 
within the store buffer tags. Unaligned accesses with 
valid tags are not allowed into the store buffer. For un- 
aligned access stores, reservation station entries 
RSO and RS1 wait until valid data is received by the 
reservation station and then the data is provided as 
two store buffer entries to the store buffer. 

Control portion 344 Includes a store buffer entry 
valid bit (V), a 2-blt unaligned access control signal 
(U A), a write protect bit (WP), a non-cachable store bit 
(NC), an Input/Output access bit (lO), a floating point 
update pointer bit (FP), a physical access bit (P). a 
locked access bit (L), and a 2-bit column indication bit 
(C1 ). The store buffer entry valid bit indicates that the 
particular entry Is valid. I.e., that there Is some valid 
information stored in this store buffer entry. The un- 
aligned access control signal indicates which portion, 
i.e., the first portion or the second portion, of an un- 
aligned access is stored in the entry. The non-cach- 
able store bit indicates that the store entry is not each- 
able and accordingly that the entry should not be writ- 
ten into data cache 150. The I/O access bit indicates 
to the external interface that an I/O access is occur- 
ring. The physical access bit indicates that the mem- 
ory management unit should bypass the linear to 
physical translation because the store address is a 
physical address; this occurs when the load store 
functional unit is updating either the page directory or 
the TLB of memory management unit 164. The locked 
access bit Indicates to unlock the external bus which 
may have been locked by a previous load. The column 
indication signal Indicates one of four columns In the 
data cache that is being written; accordingly, there is 
no need to perform a column lookup in data cache 1 50 
when performing a store operation. 

Referring to Fig, 8, store buffer entry circuit SB2 
302 is shown as an example of each store buffer entry 
circuit. Store buffer entry circuit 302 includes store 
buffer entry register 360 as well as store buffer entry 
byte data multiplexers 362, 363. 364, 365 which cor- 
respond to data bytes 0 - 3 of store buffer entry 339. 
store buffer entry tag multiplexer 370, which corre- 
sponds to the tags of store buffer entry 339 and store 
buffer entry address multiplexer 372, which corre- 
sponds to the address of store buffer entry multiplex- 
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er 339. Store buffer entry circuit 302 also includes tag 
compare circuit 374 and address compare circuit 376. 
Store buffer entry register 360 includes store buffer 
data entry register 380, store buffer address entry 
5 register 382, store buffer tag entry register 384 and 
store buffer control entry register 386. 

Store buffer entry register circuit 360 is a register 
which receives a store buffer entry 339 In parallel 
from store buffer entry data byte multiplexers 362 - 

10 365, tag multiplexer 370 and address multiplexer 372 
and provides a store buffer entry 339 in parallel to 
store buffer entry circuits SB1 and SB3. Additionally, 
store buffer data entry register 380 provides data 
bytes 0 - 3 to data port A and data port 8 of reservation 

15 station mixer circuit 220. These data bytes are provid- 
ed to allow load store functional unit 134 to perform 
a load forwarding operation. 

Byte multiplexer circuits 362 - 365 receive re- 
spective bytes from A merge circuit 306, B merge cir- 

20 cuit 308, and the four result buses as well as from 
store buffer entry circuits S83, SBO and S81, Byte 
multiplexer circuits 362 - 365 are controlled by store 
buffer control signals which are provided by load 
store controller 1 82 based upon the linear addresses 

25 for each store buffer entry and matches of the linear 
addresses from entries in the reservation station. 
The result buses are controlled by store buffer control 
signals which are provided by load store controller 
182 based upon whether a tag valid bit is present for 

30 a particular byte. If the tag valid bit is set for a partic- 
ular byte, then that particular byte monitors the result 
buses and multiplexes whichever result bus has a val- 
ue which matches the tag. 

For example, byte multiplexer circuit 362 re- 

35 ceives the byte 0 data from each of the A merge sig- 
nal, B merge signal, four result signals, and store buf- 
fer entries SB3, SBO and SB1. Based upon the store 
buffer control signals, byte multiplexer circuit 362 pro- 
vides one of each of these data bytes as the SB2 store 

40 buffer entry which is held in store buffer register cir- 
cuit 360. 

Each byte which is stored in store buffer data reg- 
ister 380 is a direct reflection of what is stored in mem- 
ory, accordingly, byte steering is provided to arrange 

45 the data bytes to correspond to what is stored In menrv 
ory. Byte steering is provided by providing byte mul- 
tiplexer 0 362 and byte multiplexer 1 363 with Inputs 
from the four result bus byte O's and the four result 
bus byte 1 in parallel, providing byte multiplexer 2 364 

50 with inputs from the four result bus byte O's, the four 
result bus byte Ts and the four result bus byte 2's in 
parallel, and providing byte multiplexer 3 365 with in- 
puts from the four result bus byte 0*s, the four result 
bus byte Ts and the four result bus byte 3's In parallel. 

55 Multiplexer's 2 and 3 364, 365 receive result buses 
bytes 0 and 1 because the L and M bytes of the result 
signal may correspond to any one of the byte posi- 
tions in the store buffer. Whereas, result byte 2 can 

11 
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only correspond to data byte 2 and result byte 3 can 
only correspond to data byte 3. 

Address multiplexer 372 receives the ADDR A 
signal and the ADDR B signal from reservation sta- 
tion 124 and provides one of these addresses to store 
buffer address register 382 as linear address 342. 
Store buffer address register 382 provides the ad- 
dress portion 342 of store buffer entry 339 to address 
compare circuit 372 which also receives the ADDR A 
and ADDR B signals from reservation station 1 24. Ad- 
dress compare circuit 372 compares the ADDR A and 
ADDR B signals to linear address 342 every clock cy- 
cle. If there is a match between ADDR A or ADDR B 
and linear address 342, then load store controller 182 
causes reservation station 124 to read the data from 
store buffer data register 380 via the port which cor- 
responds to the address compare match, rather than 
via the corresponding port of the data cache 150. 

Tag multiplexer 370 receives the tags from store 
buffer entries SBO, SB1 and SB3. Tag multiplexer 370 
also receives tags from the A and B tags of the res- 
ervation station entry. Tag bytes which are held in tag 
register 384, and are subject to forwarding, but tag 
register 384 does not receive tag inputs from the re- 
suit buses. The tags from the result bus are monitored 
by tag control circuit 374. If a tag which is held by tag 
register 384 matches a tag from one of the result bus- 
es, then tag control circuit 374 controls byte multi- 
plexers 362 - 365 to cause the result bus which pro- 
vides the tag match to provide data to the correspond- 
ing store buffer data register. 

Control portion 344 of store buffer entry 339 is 
provided by load store controller 182 to store buffer 
control register 386. 

Store buffer entry circuits SBO. SB1 and SB3 dif- 
fer only in the input signals which are provided by the 
other store buffer entries. More specifically, store 
buffer entry SBO receives only the output from store 
buffer entry SB1 . Store buffer entry SB1 receives the 
output entry from store buffer SBO and SB2. Store 
buffer entry SB3 receives the output entries from 
store buffers SBO, SB1 and SB2. 

Referring to Figs. 6-8, store buffer 180 provides 
temporary storage for pending store operations. By 
using the store byte tags, these pending store opera- 
tions need not necessarily have complete store data. 
Additionally, by using the store byte tags in conjunc- 
tion with the store buffer entry feed back, store buffer 
180 performs store forwarding operations. Addition- 
ally, because load operations may be dependent 
upon store operations which have not yet been stored 
in data cache 150, store buffer 180 can perform load 
forwarding operations. 

For example, for a doubleword store of a register 
with a pending 32-bit update, the byte tags 0-3 are 
valid in the reservation station entry, as indicated by 
the respective tag valid bits. An update is pending 
when a functional unit is going to, but has not yet, pro- 



duced a value for a store operation. If the cache ac- 
cess provides a cache hit, the store operation trans- 
fers to store buffer circuit 1 80 from reservation station 
entry RSO. The A operand upper byte tag ATAGU of 

5 the reservation station entry is duplicated as byte 3 
and byte 2 tag in the store buffer entry. The ATAGL 
and ATAGM reservation station byte tags are provid- 
ed as the store buffer byte 0 and byte 1 tags, respec- 
tively. (In the case of a doubleword write, all of these 

10 tags are actually the same.) None of the byte control 
bits BO and B1 are set. When the result is made avail- 
able by the functional unit, store buffer 180 compares 
each byte tag against the tags appearing on the result 
buses, using tag compare circuit 374 and, using mul- 

15 tiplexers 362 - 365, gates in data from the respective 
byte of the result bus whenever there is a match on a 
tag. In the case of a doubleword store, each byte 
matches simultaneously. 

For a doubleword store of a register with a pend- 

20 ing doubleword and a subsequent pending byte up- 
date to a byte of the same doubleword, at least two 
tags appear in the final doubleword; the same tag is 
used for bytes 0, 2 and 3 and a different tag is used 
for byte 1. This different tag represents that the sec- 

25 ond byte store has occurred. More specifically, the 
first doubleword is stored in store buffer entry SBO 
with four valid tags and the byte store is stored in the 
more significant store buffer entry SB 1 with a new tag 
placed in byte 1 while the tags from byte 0, 2 and 3 

30 are forwarded from SBO. Accordingly, store forward- 
ing is achieved which provides the byte 1 result onto 
the result bus and writes the byte 1 result into the dou- 
bleword storage buffer entry without steering by us- 
ing the byte control bits. 

35 For a word store of a word register with a pending 

update to bytes 2 and 3 in memory, the tags for bytes 
0 and 1 are placed into bytes 2 and 3 with the B1 bit 
set in byte 3 and the BO bit set in byte 2. When this 
tag is driven onto the result bus; these bytes simulta- 

40 neously forward from bytes 0 and 1 of the result bus 
which corresponds to this tag into store buffer data 
register 380, respectively. This example also applies 
to the case where thene are two pending byte updates 
to a word which is stored in a store buffer entry. The 

45 two bytes in the store buffer entry forward from dif- 
ferent result buses, potentially at different times. 

For a byte store, a single byte is replaced with a 
tag having the B1 or BO bit set depending upon wheth- 
er the source byte is a high byte or a low byte. When 

50 this tag matches, it gates data from the indicated byte 
of the result bus. This works even for a byte store of 
a register that has a pending word or doubleword up- 
date. In this case, the byte is expected on the corre- 
sponding location of the result bus, even though the 

55 entire bus may contain valid data. 

Sometimes when performing a store operation, 
the read phase of the store may receive data forward- 
ed from a less significant store buffer entry rather 
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that from data cache 1 50. As a result, store buffer 1 80 
inserts a tag into a data word that already has a tag 
in it. This can happen, for example, when more than 
one byte is written into the same doubleword within a 
short span of time. Accordingly, the information that 
is stored in the store buffer entry can have more than 
one tag, each representing a different result. In oper- 
ation, each tag compares against the result buses 
and gates in the proper byte at the proper time. Be- 
cause unaligned stores are not allowed to write tags 
Into store buffer 180, awkward forwarding cases do 
not occur. 

When performing a load operation, address com- 
pare circuit 376 of the store buffer 180 compares the 
linear address which is provided by RSO and RSI ad- 
ders to the linear addresses of the store buffer en- 
tries. If there is a match between the load address and 
the address stored in one of the store buffer entries 
as indicated by the hit signal that address compare 
circuit 376 provides, then load store controller 182 de- 
termines the load is dependent on the store. If the 
load is dependent on the store, the data from the store 
buffer entry which provided the linear address match 
is provided via whichever port the address matched 
was provided. This operation is referred to as a load 
forwarding operation. 

Referring to Fig. 9, data cache 150 is a linearly 
addressed cache. Co-filed application 

(Attorney Reference PCS/TT0272/SMP) 
based on US application 08/146.381, which is incor- 
porated by reference, sets forth the structure and op- 
eration of the linear addressing aspects of data cache 
1 50 in greater detail. 

An entry 400 of data cache 150 is shown. For 
each entry of data cache 150, the middle order bits 
of each linear address corresponding to the cache en- 
try provide a cache index which is used to address the 
linear tag arrays and retrieve an entry from each lin- 
ear tag array. The upper order bits of each linear ad- 
dress are compared to the linear data tags stored 
within the retrieved entries from address tag array 
310. The lowest order bits of each linear address pro- 
vide an offset into the retrieved entry to find the ac- 
tual byte addressed by the linear address. Because 
data cache 150 is always accessed in 32-bit words, 
these lowest order bits are not used when accessing 
data cache 150. 

Data cache entry 400 of data cache 150 includes 
linear address tag entry 402 and data entry 404. Data 
entry 404 includes a sixteen byte {DBYTEO - DBYTE 
15) block of data. Data linear address tag entry 402 
includes a data linear tag value (DTAG), linear tag val- 
id bit (TV), and valid physical translation bit (P). The 
data linear tag value, which corresponds to the upper 
21 bits of the linear address, indicates the linear block 
frame address of a block which is stored in the corre- 
sponding store array entry. The linear tag valid bit in- 
dicates whether or not the linear tag is valid. The valid 



physical translation bit indicates whether or not an en- 
try provides a successful physical tag hit as dis- 
cussed below. 

Referring to Fig. 10, data cache linear tag circuit 

5 202 and data cache store array 200 of linearly ad- 
dressable data cache 1 50 are shown. Data cache 1 50 
is arranged in four 2-Kbyte columns, column 0, col- 
umn 1 , column 2, and column 3. Data linear tag circuit 
202 simultaneously receives the two linear addresses 

10 ADDR A, ADDR B and data store array 312 simulta- 
neously provides the two data signals DATA A, DATA 
B, i.e., data cache 1 50 functions as a dually accessed 
data cache. 

Data store array 200 includes four separate data 

15 store arrays, column 0 store array 430, column 1 store 
array 431, column 2 store array 432, and column 3 
store array 433 as well as multiplexer (MUX) circuit 
440. Multiplexer 440 receives control signals from 
data linear tag circuit 202 which indicate whether 

20 there is a match to a linear tag value stored in a re- 
spective linear tag array. Multiplexer 440 receives the 
data from store arrays 430 - 433 and provides this 
data to load store functional unit 134. 

Linear tag circuit 202 includes linear tag arrays 

25 450 - 453 corresponding to columns 0-3. Each linear 
tag array is coupled with a corresponding compare 
circuit 454 - 457. Accordingly each column of data 
cache 150 includes a store array, a linear tag array 
and a compare circuit Store arrays 430 - 433, ad- 

30 dress tag arrays 450 - 453, and compare circuits 454 

- 457 all receive the linear addresses. ADDR A. ADDR 
B from load store section 134. 

IAD bus 102 is coupled to each store array 430 - 
433 via store address multiplexer 460 to provide both 
a store address. IAD bus 102 is also coupled to store 
register 460 which is coupled to each store array 430 

- 433, The store address, which is provided by IAD 
bus 102, is provided to index a particular column and 
to select a particular bank; the particular column is se- 
lected by column select bits, which are provided eith- 
er by store buffer 180 when performing a store or by 
physical tag circuit 162 when performing a reload. For 
a store, only one bank is accessed. The bank select 
bits, bits 2 and 3 of the address which is provided by 
IAD bus 102, are used to access the bank. For a re- 
load, all four banks are accessed in parallel. 

IAD bus 102 is used during both store operations 
and reload operations to write data to store arrays 430 

- 433 of data cache 150. When performing a store op- 
eration, data is written in store arrays 430 - 433, via 
store register 460, in 32-bit doublewords. For a store 
buffer write, the IAD bus address, which is provided 
to the ADDR B input to data cache 150. ADDR B and 
IAD address are multiplexed by address multiplexer 
461. 

When performing a reload operation, data is writ- 
ten into store anrays 430 - 433 in 128-bit lines. Store 
register 460 collects 128 bits of data from IAD bus 
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102 in two 64-bit accesses; after the 128 bits is col- 
lected, store register 460 writes this data to store ar- 
rays 430 - 433. For a reload, store register 460 multi- 
plexes the address lines of IAD bus 102 to receive 
data, because 64-bits are written in each phase. Ad- 5 
dress multiplexer 461 multiplexes the IAD address 
onto the ADDR B address path to index into the rows. 
Data cache store multiplexer 460 is controlled by data 
cache controller 190 based upon whether a store or 
a load operation is being performed. For a reload op- 10 
eration, load store controller 134 writes the reload ad- 
dress via port A of data cache 150; accordingly, data 
cache 150 uses ADDR A for a reload address. 

Referring to Figs. 11 and 12, each store array of 
data cache 150 is banked to allow multiple accesses 15 
In a single clock cycle without requiring the overhead 
associated with dual porting. More specifically, each 
store array includes four banks 470 - 473 which each 
store a 32-bit double word of data; each bank includes 
a respective bank address multiplexer 474 - 477. The 20 
combination of the four banks provides access to a 
single line of data cache 150. 

Each bank 470 - 473 is individually addressed by 
either ADDR A or ADDR B, which address is provided 
by a respective bank address multiplexer 474 - 477. 25 
Bank address multiplexers 474 - 477 are controlled by 
the bank select bits of the ADDR A and ADDR B. Be- 
cause each bank is individually addressed, more than 
one bank may be accessed simultaneously. 

For example, as seen in Fig. 11, when ADDR A 30 
addresses a line of bank 0 and ADDR B addresses the 
same line of bank 3, then multiplexer 474 causes 
ADDR A to be provided to bank 0 and multiplexer 477 
causes ADDR B to be provided to bank 2. The data 
word which is addressed by ADDR A is provided to 35 
load/store functional unit 134 as DATAAvia the DATA 
A data path and the data word which is addressed by 
ADDR B is provided to load/store functional unit 134 
as DATA B via the DATA B data path. 

As seen in Fig. 12, when ADDR A and ADDR B 40 
both address the same line of bank 0, then only this 
line and bank is accessed and the data at this location 
is provided to load/store functional unit 134 as both 
DATA A and DATA B via the DATA A and DATA B data 
paths, respectively. 45 

When the two accesses are to the same bank but 
different lines, then the port B access is stalled for 
one cycle by data cache controller 190. Because data 
cache accesses are generally random, as compared 
to instruction cache accesses which have strong lo- so 
cality, the frequency of port A, port B accesses to the 
same bank, different lines are relatively low. 

Store accesses to data cache 1 50 are via IAD bus 
102. During a store, multiplexers 474 - 478 use the 
store access to control which of banks 470 - 473 is 55 
written with the 32-bit store double word. During a re- 
load, banks 470 - 473 are written in one 128-bit line 
after the reload data has been accumulated in store 



register 460. 

Referring to Figs. 2, 9- 11, the general operation 
of data cache 150 is discussed. When a data value 
that is not stored in cache 150 is requested by 
load/store functional unit 134, then a cache miss re- 
sults. Upon detecting a cache miss, the requested 
value is written into an entry of data cache 150. More 
specifically, load store section 1 34 translates the log- 
ical address for the value to a linear address. This lin- 
ear address is provided to memory management unit 
164. The linear address of the value is checked 
against the linear tag portion of the memory manage- 
ment unit's TLB array by a TLB compare circuit to de- 
termine whether there is a TLB hit. 

If load store functional unit 134 determines that 
there was a TLB hit, then load store functional unit 
1 34 examines the data to determine whether the data 
is cacheable. If the data is cacheable and there is a 
TLB hit, then the physical tag of the corresponding 
physical address is written into a conrespondirig entry 
of physical tag circuit 162. The data linear tag array 
450 - 453 which corresponds to the array column in 
which the data was stored is written with the linear tag 
from the TLB array. 

If there is not a TLB hit then the TLB array is up- 
dated by memory management unit 164 to include the 
address of the requested value so that a TLB hit re- 
sults. Then the physical tag is written to physical tag 
circuit 162 and the linear tag is written to the appro- 
priate linear tag array 450 - 453. 

A pre-fetch request is then made by load/store 
functional unit 134 to the external memory and the 
value which is stored in the external memory at the 
physical address which corresponds to the linear ad- 
dress is retrieved from the external memory. This val- 
ue is stored in the bank, line and column of store array 
200 which corresponds to the line and column loca- 
tions of the value's linear tags which are stored in the 
linear tag arrays. The corresponding linear tag valid 
bit and valid physical translation bit in the linear tag 
array 310 are set to indicate that the entry conre- 
sponding to the linear tag is valid, that the linear tag 
is valid and that the entry provides a successful phys- 
ical translation. 

When the linear address for this value is again re- 
quested by load/store functional unit 134, load store 
section 134 converts the logical address to the linear 
address which provides a match of the linear tags in 
linear address tag array 310 with the requested ad- 
dress. Because the valid bit is set and the valid phys- 
ical translation bit is set, a linear address hit occurs, 
and the entry which is stored in the corresponding 
line of data store array 304 is provided to load/store 
functional unit 134. During the access by load store 
section 134, there is no need to access either the 
physical address tag circuit 162 or TLB circuit 164 
since the valid physical translation bit is set indicating 
that the entry has a valid physical translation. 
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Referring to Figs. 1-10 and Fig. 13. when a load 
operation is being performed by load/store functional 
unit 134 via port A and the data value to be loaded is 
available in data cache 150. then a data cache hit re- 
sults. More specifically, during 01 of cycle 1, the 
cache index, is generated calculated by adder 240 or 
RSO adder 216; the cache index is the least signifi- 
cant 11 bits of the linear address and Is calculated as 
part of the linear address compute. This cache index 
linear address is used to access the appropriate line 
and bank of data cache store array 200. When the ap- 
propriate line and bank is accessed, the linear ad- 
dress, which is calculated by adder 242, is used to ac- 
cess the appropriate column of store array 200 by 
comparing the linear tags. The data value is then re- 
turned to driver circuit 220 of reservation station cir- 
cuit 124 via the DATA A data path. This data value is 
formatted by driver circuit 220 for providing to the re- 
sult bus 0. During <D2 of cycle 1 . limit check circuit 252 
performs a segment limit check and a protection 
check, as is well known in the art, on the linear ad- 
dress. During 01 of cycle 2, the data value and cor- 
responding destination tags are driven onto result bus 
0 for port A. 

While a load operation is performed via port A, a 
corresponding load operation may be performed via 
port B. This corresponding load operation uses reser- 
vation station RSI along with its corresponding adder 
to perform the address generation of the data cache 
access. The data value and corresponding destina- 
tion tags for the entry in reservation station RSI are 
driven onto result bus 1, 

Referring to Figs. 1-10 and Fig. 14. when a store 
operation is being performed by load/store functional 
unit 134 via port A and the data value to be stored is 
already stored in data cache 150, then a data cache 
hit results. Because stores are performed as read 
modify write operation, the first portion of a store op- 
eration is similar to a load operation. After the data 
value is loaded, then the loaded value is written to 
store buffer circuit 180 in order to modify the loaded 
data value. 

More specifically, during 01 of cycle 1 . the cache 
index, is generated calculated by adder 240 or RSO 
adder 216; the cache index is the least significant 11 
bits of the linear address and is calculated as part of 
the linear address compute. This cache index linear 
address is used to access the appropriate line and 
bank of data cache store array 200. When the appro- 
priate line and bank is accessed, the linear address, 
which is calculated by adder 242, is used to access 
the appropriate column of store array 200 by compar- 
ing the linear tags. The data value is then returned to 
driver circuit 220 of reservation station circuit 124 via 
the DATA A data path. This data value is formatted by 
driver circuit 220 for providing to the result bus 0. Dur- 
ing (J>2 of cycle 1, limit check circuit 252 performs a 
segment limit check and a protection check, as is well 



known in the art, on the linear address. During <D1 of 
cycle 2. the data value and corresponding destination 
tags are driven onto result bus 0 for port A and are also 
stored in the nextavailableentry of store buffer circuit 
5 1 80. This value is held in store buffer circuit 180 until 
the store operation is retired from reorder buffer 114, 
which occurs when there are no other instructions 
pending. Reorder buffer 114, then indicates to 
load/store controller 180, using the load store retire 
10 signal, that the store instruction may be retired, i.e., 
that the store may be performed. Because stores ac- 
tually modify the state of the data value, stores are 
not speculatively performed and must wait until it is 
clear that the store is actually the next instruction be- 
ts fore reorder buffer 114 allows the store to be execut- 
ed. 

After Reorder buffer 114 has indicated that the in- 
struction may be performed, the data value and cor- 
responding linear address are driven to IAD bus 102 

20 during Ol of the cycle following the release of the in- 
struction. During 0)2 of this cycle, the data value is 
written to the appropriate line and bank of data cache 
store array 200. Additionally, if physical tag circuit 162 
indicates that the value should also be written extern- 

25 ally, then the data value is written to external memory 
at the physical address location which corresponds to 
the linear address. The physical address translation 
is performed by memory management unit 164, 
which also receives the linear address from IAD bus 

30 102. 

Referring to Figs. 1-10 and 15, when a specula- 
tive load operation is being performed by load/store 
functional unit 1 34 and the data value to be loaded is 
not available in data cache 150, then a speculative 

35 data cache miss results. The first cycle of the load op- 
eration is the same as if a cache hit had resulted." 

When cache 1 50 is accessed and the cache miss 
resulted then during cycle 2, the TLB is accessed in 
memory management unit 164 and the physical tags 

40 are accessed in physical tag circuit 162 to determine 
the physical address of the data value. This physical 
address is then checked within memory management 
unit 1 64 to confirm that the physical address does not 
violate any protection checks. During the next cycle, 

45 if the port B access is not to the same bank of cache 
array 200, then port B initiates another cache access. 
Additionally during 02 of this cycle, cache array 200 
is updated with the tag valid bits of the line from the 
tag buses. During the next cycle, the data value, des- 

50 tination tag and status are driven onto the next avail- 
able result bus and norrnal operation, which assumes 
cache hits, resumes. 

Referring to Figs. 1 -10 and 16, during a cache re- 
load, the first cycle of the reload operation is the same 

55 as if a cache hit had resulted. However, after cache 
controller 190 determines that a cache miss resulted, 
then load/store functional unit 134 waits store buffer 
circuit 180 to empty before accessing external mem- 
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ory to reload cache 1 50. After waiting several clock 
cycles, physical tag circuit 162 provides a data avail- 
able signal (L22LS) which indicates to cache 150 all 
1 28-bits of data has been written to store register 460. 
After the data is available and written into data cache 5 
array 200, then the data, destination tag and status 
information is driven onto result bus 0 by driver circuit 
220 of reservation station circuit 124. 

Referring to Fig. 1 7, for a misaligned access there 
are two access during subsequent cycles. Each of the ic 
two accesses is the same as cache hit access. The 
data which is returned from each access is accumu- 
lated by driver circuit 220. After the two accesses are 
complete, and the data has been accumulated, the 
data is formatted, as discussed above, by driver cir- is 
cult 220. The data, destination tag and status informa- 
tion is then driven onto result bus 0 by driver circuit 
220 of reservation station circuit 124. Misaligned ac- 
cesses are only performed using reservation station 
0. Accordingly, only the RSO adder and the port A por- 20 
tion of driver circuit 220 required the circuitry neces- 
sary for performing misaligned accesses. 

Other Embodiments 

25 

Other embodiments are within the following 
claims. 

For example, load/store functional unit 134 may 
be divided into two separate functional units, a load 
functional unit and a store functional unit. In this em- 30 
bodiment, the operation of the functional units would 
be substantially the same as described, however, 
-each functional unit would include a respective reser- 
vation station. I.e., the load section includes a load 
reservation station which functions as discussed with 35 
reference to loads and the store section includes a 
store reservation station which functions as dis- 
cussed with reference to stores. 



Claims 

1 . A load functional unit for performing a plurality of 
load operations in parallel comprising: 

a load functional unit, the load functional 45 
unit including 

a reservation station circuit for tem- 
porarily holding load operations, the reservation 
station circuit including 

a first reservation station en- so 
try and a second reservation station entry, the 
second reservation station entry being coupled 
to the first reservation station entry and providing 
a reservation station entry output to the first res- 
ervation station entry, and 55 

an input signal merge circuit, 
the input signal multiplexer circuit receiving load 
signals in parallel and providing a first load signal 
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to the first and second reservation station entries 
' and a second load signal to the first and second 
reservation station entries, and 

a load control circuit for con- 
trolling which of the first and second input load 
signals the first and second reservation station 
entries retrieve, and 

a data cache including a store array and a 
data cache controller, 

the store array being coupled to the 
first and second reservation station entries of the 
load functional unit via respective first and sec- 
ond data cache ports, the store array providing 
data in parallel to the load functional unit in re- 
sponse to the first and second load signals, 

the data cache controller being 
coupled to the load control circuit. 

2. The load functional unit of claim 1 wherein the 
reservation station circuit further includes 

a reservation station driver circuit, the res- 
ervation station driver circuit receiving the data 
cache data in parallel and providing the data 
cache data to first and second result buses in par- 
allel. 

3. The load functional unit of claim 1 wherein the 
load control circuit controls which load signal is 
retrieved by which of the first and second reser- 
vation entries in response to a type code match 
signal, 

the type code match signal being generat- 
ed by the load control circuit when a type code 
from a type code bus matches a predetermined 
load functional unit type code. 

4. The load functional unit of claim 1 wherein 

the reservation station circuit further in- 
clude a third reservation station entry, the third 
reservation station entry being coupled to the 
second reservation station entry and providing a 
third reservation station entry output to the sec- 
ond reservation station entry, the third reserva- 
tion station entry being coupled to the first reser- 
vation station entry and providing the third reser- 
vation station entry output to the first reservation 
station entry, one of the first and second 

reservation station entries retriev- 
ing the third reservation station entry output un- 
der control of the load control circuit, 

5. The load functional unit of claim 4 wherein 

the reservation station circuit further in- 
cludes a fourth reservation station entry, the " 
fourth reservation station entry being coupled to 
the third reservation station entry and providing 
a fourth reservation station entry output to the 
third reservation station entry, the fourth reser- 
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vation station entry being coupled to the second 
reservation station entry and providing the fourth 
reservation station entry output to the second 
reservation station entry, 

one of the third and second reser- 
vation station entries retrieving the fourth reser- 
vation station entry output under control of the 
load control circuit. 

6. The load functional unit of claim 1 wherein reser- 
vation station circuit further Includes 

first and second adder circuits coupled to 
the first and second reservation station entries, 
respectively, 

the first and second adder circuits 
receiving the load signals and providing cache 
address signals based upon the load signals, the 
cache address signals accessing respective first 
and second locations within the data cache store 
array. 

7. The load functional unit of claim 6 wherein the 
first and second adder circuits each include 

a logical address adder for receiving a 
plurality of address component signals and pro- 
viding a logical address signal, and 

a linear address adder for receiving the 
logical address signal and a segment base signal 
and providing a linear address. 

8. The load functional unit of claim 7 wherein the ad- 
dress component signal includes 

an A operand adder signal, a B operand 
adder signal and a displacement adder signal. 

9. The load functional unit of claim 8 wherein the 
first adder circuit further includes 

an a operand multiplexer circuit for receiv- 
ing an A operand signal and a zero signal and pro- 
viding one of these values as the Aoperand adder 
signal in response to address mode control infor- 
mation from the load controller, 

a B operand multiplexer circuit for receiv- 
ing a B operand signal and a misaligned address 
one signal and providing one of these signals as 
the B operand adder signal In response to ad- 
dress mode control information from the load 
controller, and 

a displacement multiplexer circuit for re- 
ceiving a displacement signal, a four signal and 
a five signal and providing one of these values as 
the displacement adder signal In response to ad- 
dress mode control information from the load 
controller. 

10. The load functional unit of claim 8 wherein the 
second adder circuit further Includes 

an a operand multiplexer circuit for receiv- 



ing an Aoperand signal and a zero signal and pro- 
viding one of these values as the Aoperand adder 
signal in response to address mode control infor- 
mation from the load controller, and 
5 a B operand multiplexer circuit for receiv- 

ing a B operand signal and a misaligned address 
one signal and providing one of these signals as 
the B operand adder signal in response to ad- 
dress mode control information from the load 
10 controller and where 

a displacement signal, is provided directly 
to the logical address adder. 

11. A store functional unit for performing a store for- 
ts warding operation comprising: 

first and second store buffer entry circuits 
for holding store operations, the second store 
buffer entry being coupled to the first store buffer 
entry and providing a second store buffer entry 
20 output to the first store buffer entry, the first store 

buffer entry being coupled to the second store 
buffer entry and providing a first store buffer en- 
try output to the second store buffer entry; and 
a store controller for controlling whether 
25 the second store buffer entry circuit retrieves the 

first store buffer entry output to perform a store 
forwarding operation with the first store buffer 
entry output; 

the store controller being coupled 
30 to the first and second store buffer entry circuits. 

12. The store functional unit of claim 11 further com- 
prising 

a third store buffer entry circuit, the, third 
35 store buffer entry circuit being coupled to the sec- 

ond store buffer entry circuit and providing a third 
store buffer entry output to the second store buf- 
fer entry, the first store buffer entry circuit being 
coupled to the third store buffer entry circuit and 
40 providing a first store buffer entry output to the 

third store buffer entry circuit, and the second 
store buffer entry circuit being coupled to the 
third store buffer entry circuit and providing a 
second store buffer entry output to the third store 
45 buffer entry circuit; 

and wherein 

the store controller is coupled to the third 
store buffer entry circuit and controls whether the 
third store buffer-entry circuit retrieves either the 
50 first or second store buffer entry outputs to per- 

form a store forwarding operation with the first or 
second store buffer entry outputs. 

1 3. The store functional unit of claim 1 2 further comf- 
55 prising 

a fourth store buffer entry circuit, the 
fourth store buffer entry circuit being coupled to 
the third store buffer entry circuit and providing a 
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fourth store buffer entry output to the third store 
buffer entry, the first store buffer entry circuit be- 
ing coupled to the fourth store buffer entry circuit 
and providing a first store buffer entry output to 
the fourth store buffer entry circuit, the second 5 
store buffer entry circuit being coupled to the 
fourth store buffer entry circuit and providing a 
second store buffer entry output to the fourth 
store buffer entry circuit, and the third store buf- 
fer entry circuit being coupled to the fourth store io 
buffer entry circuit and providing a third store buf- 
fer entry output to the fourth store buffer entry 
circuit; 
and wherein 

the store controller is coupled to the fourth 15 
store buffer entry circuit and controls whether the 
fourth store buffer entry circuit retrieves either 
the first or second store buffer entry outputs to 
perform a store forwarding operation with the 
first or second store buffer entry outputs. 20 

14. The store functional unit of claim 11 whereineach 
of the first and second store buffer entry circuits 
includes 

a store buffer register circuit for holding a 25 
store buffer entry, and 

a store buffer multiplexer circuit for con- 
trolling which signals are providing to the store 
buffer register circuit for holding. 

30 

15. The store functional unit of claim 14 wherein the 
store buffer register circuit includes 

a store buffer entry data register for hold- 
ing a store buffer data entry of the store buffer en- 

^''Vi 35 

a store buffer entry address register for 
holding a store buffer address entry of the store 
buffer entry; and 

a store buffer entry tag portion for holding 
a store buffer tag entry of the store buffer entry. 40 

16. The store functional unit of claim 15 wherein the 
store buffer multiplexer circuit includes 

a data byte multiplexer circuit for receiving 
a plurality of data signals and providing one of the 45 
plurality of data signals as the store buffer data 
entry under control of the store controller; 

an address byte multiplexer circuit for re- 
ceiving a plurality of address signals and provid- 
ing one of the plurality of address signals as the so 
store buffer address entry under control of the 
store controller; and 

a tag multiplexer circuit for receiving a plur- 
ality of tag signals and providing at least one of 
the plurality of tag signals as the store buffer tag 55 
entry under control of the store controller. 

17. A load/store functional unit of a microprocessor. 



the load/store functional unit performing load op- 
erations and store operations to a cache in par- 
allel comprising; 

a reservation station circuit for temporarily 
holding load operations and store operations, the 
reservation station circuit including first and sec- 
ond reservation station entries, the first and sec- 
ond reservation station entries being coupled to 
first and second ports of a data cache; 

a store buffer circuit for temporarily hold- 
ing store operations, the store buffer circuit in- 
cluding first and second store buffer entries for 
temporarily holding store operations, at least one 
of the store buffer entries being coupled to at 
least one of the reservation station entries; and 

a control circuit for controlling the reserva- 
tion station entries and the store buffer entries, 
the control circuit being coupled to the reserva- 
tion station circuit, to the store buffer circuit and 
to the data cache. 

1 8. A apparatus for processing information, compris- 
ing: 

an external memory for holding the infor- 
mation; and 

a processor coupled with the main mem- 
ory via a processor bus, the processor including 

a cache for temporarily storing the 
information, the cache being coupled to the ex- 
ternal memory; and 

a load/store functional unit for per- 
forming load operations and store operations, the 
load/store functional unit including 

a reservation station circuit 
for temporarily holding load operations and store 
operations, the reservation station circuit includ- 
ing first and second reservation station entries, 
the first and second reservation station entries 
being coupled to first and second ports of the 
data cache; 

a store buffer circuit for tem- 
porarily holding store operations, the store buffer 
circuit including first and second store buffer en- 
tries for temporarily holding store operations, at 
least one of the store buffer entries being coupled 
to at least one of the reservation station entries; 
and 

a control circuit for control- 
ling the reservation station entries and the store 
buffer entries, the control circuit being coupled to 
the reservation station circuit, to the store buffer 
circuit and to the data cache. 
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